Building Logistic Regression & Decision Tree model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Which segment of customers should be targeted more.
Build Logistic Regression & Decision Tree model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Read given data to data frame and understand data nature, given features, total records, given data has any missing values or duplicate data, outliers.
Visualize data and and understand data range and outliers
# this will help in making the Python code more structured automatically
%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
Read given csv file Loan_Modelling.csv and load to data frame data.
# reading loan data given from bank and load to data frame
loan = pd.read_csv("Loan_Modelling.csv")
# copying orignal data so that when changing data we dont lose original
data = loan.copy()
data.head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.shape
(5000, 14)
observations on data
checking data types of all columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
observations on data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
observations on data
ID - as we noted earlier its just row number, this column will be dropped later. Age - Age seems to be evenly distributed with Min 23, Max 67 and 25% on 35, 50% on 45 and 75% on 55, Looks like no outliers in age data.Experience - has some negative values, we have to check this in data pre processing step. otherwise experience seems to be normally distributed. Income - Has some outlier data or data needs to be normalized. Min and Max difference, 75% and max difference is very high, Lets fix these data in next stages.ZIPCode - we might reqd to convert this data into city/state/county or metropolitan data, We cannot use zip code feature directly. Family - data seems to be evenly distributed.CCAvg - Comparing 75% with max, this data might have outliers and reqd some transformation in next stages.Education - seems to be evenly distributed.Mortgage - has outliers since 75th and max difference is veru high, we have to normlize this data in next stages.Personal_Loan - boolean indicator. lets see data spread in next section Securities_Account - boolean indicator. lets see data spread in next section CD_Account - boolean indicator. lets see data spread in next section Online - boolean indicator. lets see data spread in next section CreditCard - boolean indicator. lets see data spread in next section lets check which columns has some null values, how many null values
# Prints total null value count(s) for all columns in input data frame
def print_null_info(df):
"""
Prints total null value count(s) for all columns in input data frame
"""
print("\nTotal Null value counts\n")
print(df.isnull().sum().sort_values(ascending=False))
print_null_info(data)
Total Null value counts ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
# check for any duplciate data
data[data.duplicated()].shape
(0, 14)
# Prints unique value counts, top 10 value & count(s) for all category columns in input data frame
def print_category_value_counts(df, column_names):
"""
Prints unique value counts, top 10 value & count(s) for all category columns in input data frame
"""
print()
for typeval, col in zip(df[column_names].dtypes, df[column_names]):
print()
print(f"Column name : {col} has total {df[col].nunique()} unique values")
print()
print(df[col].value_counts()[0:10])
print()
print("-" * 50)
# print value types and value counts
cols = [
"Age",
"Experience",
"Income",
"ZIPCode",
"Family",
"CCAvg",
"Education",
"Mortgage",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
]
print_category_value_counts(data, cols)
Column name : Age has total 45 unique values 35 151 43 149 52 145 58 143 54 143 50 138 41 136 30 136 56 135 34 134 Name: Age, dtype: int64 -------------------------------------------------- Column name : Experience has total 47 unique values 32 154 20 148 9 147 5 146 23 144 35 143 25 142 28 138 18 137 19 135 Name: Experience, dtype: int64 -------------------------------------------------- Column name : Income has total 162 unique values 44 85 38 84 81 83 41 82 39 81 40 78 42 77 83 74 43 70 45 69 Name: Income, dtype: int64 -------------------------------------------------- Column name : ZIPCode has total 467 unique values 94720 169 94305 127 95616 116 90095 71 93106 57 92037 54 93943 54 91320 53 91711 52 94025 52 Name: ZIPCode, dtype: int64 -------------------------------------------------- Column name : Family has total 4 unique values 1 1472 2 1296 4 1222 3 1010 Name: Family, dtype: int64 -------------------------------------------------- Column name : CCAvg has total 108 unique values 0.00 106 1.90 106 0.90 106 1.60 101 2.10 100 2.40 92 2.60 87 1.10 84 1.20 66 2.30 58 2.70 58 2.90 54 3.00 53 3.30 45 3.80 43 3.40 39 2.67 36 4.00 33 4.50 29 3.60 27 3.90 27 4.30 26 6.00 26 3.70 25 4.70 24 3.20 22 4.10 22 4.90 22 3.10 20 0.67 18 1.67 18 5.00 18 2.33 18 5.40 18 6.50 18 4.40 17 5.20 16 3.50 15 4.60 14 6.10 14 6.90 14 7.00 14 7.20 13 7.40 13 5.70 13 6.30 13 8.00 12 7.50 12 4.20 11 6.33 10 8.10 10 7.30 10 6.80 10 8.80 9 6.70 9 6.67 9 7.60 9 7.80 9 4.33 9 1.75 9 0.75 9 1.33 9 8.60 8 5.60 7 4.80 7 5.10 6 5.90 5 7.90 4 5.50 4 6.60 4 5.30 4 5.80 3 6.40 3 10.00 3 Name: CCAvg, dtype: int64 -------------------------------------------------- Column name : Education has total 3 unique values 1 2096 3 1501 2 1403 Name: Education, dtype: int64 -------------------------------------------------- Column name : Mortgage has total 347 unique values 0 3462 98 17 89 16 91 16 83 16 119 16 103 16 90 15 102 15 78 15 Name: Mortgage, dtype: int64 -------------------------------------------------- Column name : Personal_Loan has total 2 unique values 0 4520 1 480 Name: Personal_Loan, dtype: int64 -------------------------------------------------- Column name : Securities_Account has total 2 unique values 0 4478 1 522 Name: Securities_Account, dtype: int64 -------------------------------------------------- Column name : CD_Account has total 2 unique values 0 4698 1 302 Name: CD_Account, dtype: int64 -------------------------------------------------- Column name : Online has total 2 unique values 1 2984 0 2016 Name: Online, dtype: int64 -------------------------------------------------- Column name : CreditCard has total 2 unique values 0 3530 1 1470 Name: CreditCard, dtype: int64 --------------------------------------------------
Age - has 45 unique values, we can create age binExperience - has 47 unique values. Income - is real numeric values with 162 unique values. Income will vary with indiviuals ZIPCode - 467 unique values - can be convered in next stagesFamily - 4 unique valuesCCAvg - 108 unique values this values will vary with indiviuals Education - 3 unique valuesMortgage - Mortgage has total 347 unique values, this values will vary with indiviuals Personal_Loan, Securities_Account, CD_Account, Online, CreditCard - boolean indicator with 0s and 1s# Drop ID Columns
data.drop("ID", axis=1, inplace=True)
Visualize all features before any data clean up and understand what data needs cleaning and fixing.
Univariate analysis helps to check data skewness and possible outliers and spread of the data.
creating a method that can plot univariate chart with histplot, boxplot and barchart %
## building a Common method to generate graphs
def generate_univariate_chart(data, feature, hue=None, kde=False, bins=20):
"""
Builds histplot and boxplot for given field.
Can plot hue, kde and bins based on params, these are optional columns
"""
sns.set_style("darkgrid")
print(f"Genrating Charts for feature : {feature}")
# sns.set_context('poster',font_scale=1)
# figsize(width,height)
fig, axes = plt.subplots(2, figsize=(15, 15))
fig.suptitle("Univariate analysis for " + feature)
sns.histplot(
data=data,
x=feature,
ax=axes[0],
palette="winter",
bins=bins,
kde=kde,
hue=hue,
multiple="dodge",
)
sns.boxplot(
data=data, x=feature, ax=axes[1], showmeans=True, color="violet", hue=hue
)
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 10))
else:
plt.figure(figsize=(n + 1, 10))
plt.xticks(rotation=90, fontsize=25)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data=data, feature="Personal_Loan", perc=True)
labeled_barplot(data=data, feature="Securities_Account", perc=True)
labeled_barplot(data=data, feature="CD_Account", perc=True)
labeled_barplot(data=data, feature="Online", perc=True)
labeled_barplot(data=data, feature="CreditCard", perc=True)
# with all params
generate_univariate_chart(data=data, feature="Age", hue=None, bins=40, kde=False)
labeled_barplot(data=data, feature="Age", perc=True, n=40)
Genrating Charts for feature : Age
# with all params
generate_univariate_chart(data=data, feature="Experience", hue=None, bins=40, kde=False)
Genrating Charts for feature : Experience
# with all params
generate_univariate_chart(data=data, feature="Income", hue=None, bins=30, kde=False)
Genrating Charts for feature : Income
# with all params
labeled_barplot(data=data, feature="ZIPCode", perc=True, n=30)
# with all params
labeled_barplot(data=data, feature="Family", perc=True, n=10)
# with all params
generate_univariate_chart(data=data, feature="CCAvg", hue=None, bins=30, kde=False)
Genrating Charts for feature : CCAvg
labeled_barplot(data=data, feature="Education", perc=True, n=3)
# with all params
generate_univariate_chart(data=data, feature="Mortgage", hue=None, bins=30, kde=False)
Genrating Charts for feature : Mortgage
Data Observations
Age - has no outliers, but we can convert to binsExperience - has no outliers, But we find some negative values so those needs treatment.Income - has outliers, we have to check data and scale data.ZIPCode - 467 unique values, has no outliers but we cannot use this data as it is so we can convert this into city/county or metropolitan feature.Family - 4 unique value, No treatments reqd. we have leave this feature as it is.CCAvg - 108 unique values this values will vary with indiviuals, Has lot of outliers so we have to treat this data and fix and apply scaler. Education - 3 unique values - No treatments reqd. we have leave this feature as it is.Mortgage - Mortgage has total 347 unique values, this values will vary with indiviuals, Has lot of outliers so we have to treat this data and fix and apply scaler. Personal_Loan, Securities_Account, CD_Account, Online, CreditCard - boolean indicator with 0s and 1s - no treatmens reqd.Data cleaning and feature conversions based on knowledge gathered from intial data analysis
No missing values. So no missing value treatment will be applied
Features Age, ZIP Code needs feature conversions
import zipcodes
# is_real is a method available with zipcodes that can return true or false for given zipcode is valid or not.
# matching method provide data for given zipcode
## find_county_from_zipcode - building a Common method to generate county and state data using zip code
def find_county_state_from_zipcode(zipcode):
"""
Zipcodes is a simple library for querying over U.S. zipcode data.
method will validate input zip code is valid and find county information
"""
zipcode = str(zipcode)
if zipcodes.is_real(zipcode):
zipdata = zipcodes.matching(zipcode)[0]
return zipdata["county"], zipdata["state"]
else:
return None, None
# tesing method with one valid and one invalid data
# valid zip code
print(
f"valid zip code 30338 county details : {find_county_state_from_zipcode('30338')}"
)
# invalid zip code
print(
f"invalid zip code 3038 county details : {find_county_state_from_zipcode('3038')}"
)
valid zip code 30338 county details : ('DeKalb County', 'GA')
invalid zip code 3038 county details : (None, None)
## add new column called county and state from zipcode
# data["County"], data["State"] = find_county_state_from_zipcode(data["ZIPCode"])
data.loc[:, ["County", "State"]] = [
find_county_state_from_zipcode(i) for i in data["ZIPCode"]
]
## lets see some sample values how state and county populated.
data.sample(10)
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3519 | 31 | 5 | 84 | 94720 | 4 | 1.8 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | Alameda County | CA |
| 1612 | 41 | 17 | 33 | 94550 | 1 | 0.7 | 1 | 104 | 0 | 0 | 0 | 0 | 0 | Alameda County | CA |
| 2394 | 42 | 18 | 145 | 94065 | 2 | 8.0 | 1 | 505 | 0 | 0 | 0 | 0 | 0 | San Mateo County | CA |
| 3122 | 38 | 14 | 54 | 90095 | 2 | 0.6 | 3 | 218 | 0 | 0 | 0 | 0 | 0 | Los Angeles County | CA |
| 1273 | 60 | 35 | 130 | 95741 | 3 | 6.3 | 3 | 437 | 1 | 0 | 1 | 1 | 1 | Sacramento County | CA |
| 3660 | 38 | 12 | 59 | 93401 | 2 | 2.4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Luis Obispo County | CA |
| 3281 | 51 | 25 | 62 | 95014 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Santa Clara County | CA |
| 519 | 34 | 9 | 48 | 95616 | 1 | 2.5 | 3 | 105 | 0 | 0 | 0 | 1 | 0 | Yolo County | CA |
| 2229 | 46 | 22 | 72 | 91711 | 4 | 1.4 | 2 | 149 | 0 | 0 | 0 | 1 | 1 | Los Angeles County | CA |
| 1411 | 65 | 39 | 184 | 91302 | 1 | 5.4 | 3 | 176 | 1 | 0 | 1 | 1 | 1 | Los Angeles County | CA |
print_null_info(data)
Total Null value counts County 34 State 34 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
# print value types and value counts
colsZip = ["ZIPCode", "County", "State"]
print_category_value_counts(data, colsZip)
Column name : ZIPCode has total 467 unique values 94720 169 94305 127 95616 116 90095 71 93106 57 92037 54 93943 54 91320 53 91711 52 94025 52 Name: ZIPCode, dtype: int64 -------------------------------------------------- Column name : County has total 38 unique values Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Name: County, dtype: int64 -------------------------------------------------- Column name : State has total 1 unique values CA 4966 Name: State, dtype: int64 --------------------------------------------------
Fix missing data values
# filter missing county data
data[data["County"].isna()]["ZIPCode"].value_counts()
92717 22 96651 6 92634 5 93077 1 Name: ZIPCode, dtype: int64
we dont have data for 4 zip codes, searching and finding data
92717 - Orange, CA 96651 - WASHINGTON, DC 92634 - Fullerton, CA 93077 - Astoria, OR
data["County"].loc[data["ZIPCode"] == 92717] = "Orange County"
data["State"].loc[data["ZIPCode"] == 92717] = "CA"
data["County"].loc[data["ZIPCode"] == 96651] = "San Francisco County"
data["State"].loc[data["ZIPCode"] == 96651] = "CA"
data["County"].loc[data["ZIPCode"] == 92634] = "Orange County"
data["State"].loc[data["ZIPCode"] == 92634] = "CA"
data["County"].loc[data["ZIPCode"] == 93077] = "Los Angeles County"
data["State"].loc[data["ZIPCode"] == 93077] = "CA"
# checking to see still county has any missing values
data[data["County"].isna()].size
# Drop ZipCode Columns
data.drop("ZIPCode", axis=1, inplace=True)
# Age Ranges
data["AgeRange"] = pd.cut(
data["Age"],
[-np.inf, 18, 30, 40, 50, 60, 70, np.inf],
labels=["<=18", "19 to 29", "30 to 39", "40 to 49", "50 to 59", "60 to 69", ">=70"],
)
# drop age column
data.drop("Age", axis=1, inplace=True)
# fixing all negative values
data["Experience"] = abs(data["Experience"])
# Experience Ranges
data["ExperienceRange"] = pd.cut(
data["Experience"],
[-np.inf, 1, 3, 6, 10, 15, 20, 30, 40, np.inf],
labels=[
"No Experience",
"<3",
"3 to 5",
"6 to 9",
"10 to 14",
"15 to 19",
"20 to 29",
"30 to 39",
">=40",
],
)
# drop Experience column
data.drop("Experience", axis=1, inplace=True)
Visualize all features after data clean up and understand how data related with each other and with target dependent feature.
checking data types of all columns
# convert all newly added columns as category
cat_vars = ["County", "State", "AgeRange", "ExperienceRange"]
for colname in cat_vars:
data[colname] = data[colname].astype("category")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Income 5000 non-null int64 1 Family 5000 non-null int64 2 CCAvg 5000 non-null float64 3 Education 5000 non-null int64 4 Mortgage 5000 non-null int64 5 Personal_Loan 5000 non-null int64 6 Securities_Account 5000 non-null int64 7 CD_Account 5000 non-null int64 8 Online 5000 non-null int64 9 CreditCard 5000 non-null int64 10 County 5000 non-null category 11 State 5000 non-null category 12 AgeRange 5000 non-null category 13 ExperienceRange 5000 non-null category dtypes: category(4), float64(1), int64(9) memory usage: 412.4 KB
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
lets check which columns has some null values, how many null values
print_null_info(data)
Total Null value counts Income 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 County 0 State 0 AgeRange 0 ExperienceRange 0 dtype: int64
Lets us look at different values by features
# print value types and value counts
cols = [
"AgeRange",
"ExperienceRange",
"County",
"State",
]
print_category_value_counts(data, cols)
Column name : AgeRange has total 5 unique values 50 to 59 1323 40 to 49 1270 30 to 39 1236 19 to 29 624 60 to 69 547 <=18 0 >=70 0 Name: AgeRange, dtype: int64 -------------------------------------------------- Column name : ExperienceRange has total 9 unique values 20 to 29 1301 30 to 39 1103 15 to 19 672 10 to 14 581 6 to 9 505 3 to 5 378 <3 233 No Experience 173 >=40 54 Name: ExperienceRange, dtype: int64 -------------------------------------------------- Column name : County has total 38 unique values Los Angeles County 1096 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 366 San Francisco County 263 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Name: County, dtype: int64 -------------------------------------------------- Column name : State has total 1 unique values CA 5000 Name: State, dtype: int64 --------------------------------------------------
Univariate analysis helps to check data skewness and possible outliers and spread of the data.
checking how every feature has data after data cleaning and how it is related with dependent variables
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
"""
function to plot distributions wrt target
"""
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 100)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
## this method generate joint plot of x vs y feature
def generate_bivariate_chart(data, xfeature, yfeature, hue=None):
"""
common method to generate joint plot for various columns
hue param is optional
"""
sns.set_style("darkgrid")
print(f"Genrating Charts for feature x : {xfeature}, y : {yfeature}")
sns.jointplot(
data=data,
x=xfeature,
y=yfeature,
palette="winter",
height=10,
kind="scatter",
hue=hue,
)
labeled_barplot(data=data, feature="AgeRange", perc=True, n=10)
stacked_barplot(data, "AgeRange", "Personal_Loan")
Personal_Loan 0 1 All AgeRange All 4520 480 5000 40 to 49 1148 122 1270 30 to 39 1118 118 1236 50 to 59 1208 115 1323 19 to 29 558 66 624 60 to 69 488 59 547 ----------------------------------------------------------------------------------------------------
labeled_barplot(data=data, feature="ExperienceRange", perc=True, n=10)
stacked_barplot(data, "ExperienceRange", "Personal_Loan")
Personal_Loan 0 1 All ExperienceRange All 4520 480 5000 20 to 29 1182 119 1301 30 to 39 1000 103 1103 15 to 19 605 67 672 6 to 9 448 57 505 10 to 14 530 51 581 3 to 5 343 35 378 <3 207 26 233 No Experience 158 15 173 >=40 47 7 54 ----------------------------------------------------------------------------------------------------
labeled_barplot(data=data, feature="County", perc=True, n=40)
stacked_barplot(data, "County", "Personal_Loan")
Personal_Loan 0 1 All County All 4520 480 5000 Los Angeles County 985 111 1096 Santa Clara County 492 71 563 San Diego County 509 59 568 Alameda County 456 44 500 Orange County 333 33 366 San Francisco County 244 19 263 Monterey County 113 15 128 Sacramento County 169 15 184 Contra Costa County 73 12 85 San Mateo County 192 12 204 Ventura County 103 11 114 Santa Barbara County 143 11 154 Santa Cruz County 60 8 68 Yolo County 122 8 130 Kern County 47 7 54 Sonoma County 22 6 28 Marin County 48 6 54 Riverside County 50 6 56 San Luis Obispo County 28 5 33 Solano County 30 3 33 San Bernardino County 98 3 101 Shasta County 15 3 18 Humboldt County 30 2 32 Butte County 17 2 19 Placer County 22 2 24 Fresno County 24 2 26 San Joaquin County 12 1 13 El Dorado County 16 1 17 Mendocino County 7 1 8 Stanislaus County 14 1 15 Imperial County 3 0 3 Napa County 3 0 3 Siskiyou County 7 0 7 Merced County 4 0 4 Trinity County 4 0 4 Tuolumne County 7 0 7 Lake County 4 0 4 San Benito County 14 0 14 ----------------------------------------------------------------------------------------------------
lets see how feature are related to each other and how its relation with target feature
generate_bivariate_chart(
xfeature="Income", yfeature="CCAvg", data=data, hue="Personal_Loan"
)
Genrating Charts for feature x : Income, y : CCAvg
generate_bivariate_chart(
xfeature="Income", yfeature="Mortgage", data=data, hue="Personal_Loan"
)
Genrating Charts for feature x : Income, y : Mortgage
generate_bivariate_chart(
xfeature="Mortgage", yfeature="CCAvg", data=data, hue="Personal_Loan"
)
Genrating Charts for feature x : Mortgage, y : CCAvg
stacked_barplot(data, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ----------------------------------------------------------------------------------------------------
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ----------------------------------------------------------------------------------------------------
stacked_barplot(data, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ----------------------------------------------------------------------------------------------------
stacked_barplot(data, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ----------------------------------------------------------------------------------------------------
stacked_barplot(data, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ----------------------------------------------------------------------------------------------------
stacked_barplot(data, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ----------------------------------------------------------------------------------------------------
plt.figure(figsize=(18, 10))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral").set_title("Complete Data Correlation")
plt.show()
plt.figure(figsize=(18, 10))
sns.heatmap(
data[data["Personal_Loan"] == 1].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap="Spectral",
).set_title("Personal Loan Correlation")
plt.show()
observations
Personal_Loan - income, Credit card spending, CD account has some relationship. we have to check this further and understand the relationship, Who got loan last year has good relation with CD_Account, Securities_Account, Online Account and Credit Card
income and CCAvg has some relation, and who got loans has relation on Securitues account and CD Account, Online and has credit card from other banks.
sns.pairplot(data, hue="Personal_Loan")
plt.show()
observations
Income & CCAvg, Income & Mortgage has positive relation , CCAvg & Mortage has scatter relation
Data Description:
Data Cleaning:
Observations from EDA:
Income - income has positive relation with CCAvg and only after certain level of income Personal_Loan was offered to customer. Make sense no Income or customers with out CD_Account cannot repay loans.
Family - Family of 3 or 4 offered more loans than 1 or 2, may be because of combined income level will be high
CCAvg - spend limit and Income is related, and related to Personal_Loan
Education - Level 2 and 3 offered more loans because of Education level 2 and 3 get more Income and can offered Personal_Loan and repay
Mortgage - Mortgage has relation with Education level 2 and 3 get more Income
Personal_Loan - we have onluy 9% customer offered loan, so data is much biased towards not offered loan, We have to handle this when building model. And has relation with most of the features, like Income, Family, Education, CD_Account, Securities_Account, County
Securities_Account - has some relation with Personal_Loan
CD_Account - has some relation with Personal_Loan Online - No much relation with Personal_Loan or other features CreditCard - No much relation with Personal_Loan or other features County - some counties are offered more Personal_Loan and because of Income in that county State - All are in california. AgeRange - 5. different age range values, mostly between 25 and 40 offered more loans ExperienceRange - high experience get high income, and more loans offered. X = data.drop("Personal_Loan", axis=1)
Y = data["Personal_Loan"]
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
X.shape
(5000, 60)
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Income 5000 non-null int64 1 Family 5000 non-null int64 2 CCAvg 5000 non-null float64 3 Education 5000 non-null int64 4 Mortgage 5000 non-null int64 5 Securities_Account 5000 non-null int64 6 CD_Account 5000 non-null int64 7 Online 5000 non-null int64 8 CreditCard 5000 non-null int64 9 County_Butte County 5000 non-null uint8 10 County_Contra Costa County 5000 non-null uint8 11 County_El Dorado County 5000 non-null uint8 12 County_Fresno County 5000 non-null uint8 13 County_Humboldt County 5000 non-null uint8 14 County_Imperial County 5000 non-null uint8 15 County_Kern County 5000 non-null uint8 16 County_Lake County 5000 non-null uint8 17 County_Los Angeles County 5000 non-null uint8 18 County_Marin County 5000 non-null uint8 19 County_Mendocino County 5000 non-null uint8 20 County_Merced County 5000 non-null uint8 21 County_Monterey County 5000 non-null uint8 22 County_Napa County 5000 non-null uint8 23 County_Orange County 5000 non-null uint8 24 County_Placer County 5000 non-null uint8 25 County_Riverside County 5000 non-null uint8 26 County_Sacramento County 5000 non-null uint8 27 County_San Benito County 5000 non-null uint8 28 County_San Bernardino County 5000 non-null uint8 29 County_San Diego County 5000 non-null uint8 30 County_San Francisco County 5000 non-null uint8 31 County_San Joaquin County 5000 non-null uint8 32 County_San Luis Obispo County 5000 non-null uint8 33 County_San Mateo County 5000 non-null uint8 34 County_Santa Barbara County 5000 non-null uint8 35 County_Santa Clara County 5000 non-null uint8 36 County_Santa Cruz County 5000 non-null uint8 37 County_Shasta County 5000 non-null uint8 38 County_Siskiyou County 5000 non-null uint8 39 County_Solano County 5000 non-null uint8 40 County_Sonoma County 5000 non-null uint8 41 County_Stanislaus County 5000 non-null uint8 42 County_Trinity County 5000 non-null uint8 43 County_Tuolumne County 5000 non-null uint8 44 County_Ventura County 5000 non-null uint8 45 County_Yolo County 5000 non-null uint8 46 AgeRange_19 to 29 5000 non-null uint8 47 AgeRange_30 to 39 5000 non-null uint8 48 AgeRange_40 to 49 5000 non-null uint8 49 AgeRange_50 to 59 5000 non-null uint8 50 AgeRange_60 to 69 5000 non-null uint8 51 AgeRange_>=70 5000 non-null uint8 52 ExperienceRange_<3 5000 non-null uint8 53 ExperienceRange_3 to 5 5000 non-null uint8 54 ExperienceRange_6 to 9 5000 non-null uint8 55 ExperienceRange_10 to 14 5000 non-null uint8 56 ExperienceRange_15 to 19 5000 non-null uint8 57 ExperienceRange_20 to 29 5000 non-null uint8 58 ExperienceRange_30 to 39 5000 non-null uint8 59 ExperienceRange_>=40 5000 non-null uint8 dtypes: float64(1), int64(8), uint8(51) memory usage: 600.7 KB
Now it's time to do a train test split, and train our model!
Split the data into training set and testing set using train_test_split
# import train_test_split library
from sklearn.model_selection import train_test_split
# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.3, random_state=101
)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 3500 Number of rows in test data = 1500
print("Percentage of classes in training set")
print(y_train.value_counts(normalize=True) * 100)
print()
print("Percentage of classes in test set")
print(y_test.value_counts(normalize=True) * 100)
Percentage of classes in training set 0 90.457143 1 9.542857 Name: Personal_Loan, dtype: float64 Percentage of classes in test set 0 90.266667 1 9.733333 Name: Personal_Loan, dtype: float64
Recall - What proportion of actual positives was identified correctly? It is the measure of the correctly identified positive cases from all the actual positive cases. It is important when the cost of False Negatives is high.
Precision - What proportion of positive identifications was actually correct? It is implied as the measure of the correctly identified positive cases from all the predicted positive cases. Thus, it is useful when the costs of False Positives is high.
Accuracy - What proportion of correct predictions vs total predictions. One of the more obvious metrics, it is the measure of all the correctly identified cases. It is most used when all the classes are equally important.
F1-score - This is the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric.
Both the cases are important, Bank does not want to lose money by giving loan to customers who cannot pay. And bank does not want a revenue oppurtunity to get intrests and repayment from customer who can pay
False Negative - Loss of opportunity will be the greater loss as the bank will be losing a potential customer.
False Positive - will cost money to bank since customer may not pay back loan.
f1_score should be maximized, the greater the f1_score higher the chances of identifying both the classes correctly.recall should be maximized, Bank would want to reduce false negatives, Greater the recall lesser the chances of false negatives.Precision should be maximized, so bank can make money by providing loans to eligible customers.# import libraries
from sklearn.linear_model import LogisticRegression
# To build model for prediction
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(
model, predictors, target, threshold=0.5, modelname="", datatype=""
):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Model":modelname,"Data":datatype,"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,"Threshold":threshold},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True")
plt.xlabel("Predicted")
Train and fit a logistic regression model on the training set.
# using newton-cg solver since its faster for high-dimensional data
model1 = LogisticRegression(solver="newton-cg", random_state=1)
lg1 = model1.fit(X_train, y_train)
Now predict values for the testing data. And Create a classification report for the model.
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg1, X_train, y_train, threshold=0.5)
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
lg1, X_train, y_train, threshold=0.5, modelname="Default", datatype="Train"
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Train | 0.954286 | 0.649701 | 0.834615 | 0.73064 | 0.5 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg1, X_test, y_test)
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
lg1, X_test, y_test, threshold=0.5, modelname="Default", datatype="Test"
)
print("Test set performance:")
log_reg_model_test_perf
Test set performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Test | 0.950667 | 0.575342 | 0.875 | 0.694215 | 0.5 |
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Optimal threshold using AUC-ROC curve
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict_proba(X_train)[:, 1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(f"Optimal threshold value is {optimal_threshold_auc_roc}")
Optimal threshold value is 0.16127516065808437
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg1, X_train, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model - training
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg1,
X_train,
y_train,
threshold=optimal_threshold_auc_roc,
modelname="Optimal threshold value",
datatype="Train",
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Optimal threshold value | Train | 0.93 | 0.868263 | 0.590631 | 0.70303 | 0.161275 |
# checking model performance for this model - test
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg1,
X_test,
y_test,
threshold=optimal_threshold_auc_roc,
modelname="Optimal threshold value",
datatype="Test",
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Optimal threshold value | Test | 0.916 | 0.80137 | 0.546729 | 0.65 | 0.161275 |
Logistic Regression model with Optimal threshold value is giving a good performance on data where recall score is better than previous model tthresholdr, but Precision and F1 score is low.
lets try with Precision-Recall curve to find better threshold
y_scores = lg1.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b-", label="precision")
plt.plot(thresholds, recalls[:-1], "g-", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.36
# checking model performance for this model
log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg1,
X_train,
y_train,
threshold=optimal_threshold_curve,
modelname="Thres Recall vs Preci",
datatype="Train",
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Thres Recall vs Preci | Train | 0.953143 | 0.730539 | 0.767296 | 0.748466 | 0.36 |
# checking model performance for this model
log_reg_model_test_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg1,
X_test,
y_test,
threshold=optimal_threshold_curve,
modelname="Thres Recall vs Preci",
datatype="Test",
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Thres Recall vs Preci | Test | 0.948 | 0.650685 | 0.778689 | 0.708955 | 0.36 |
lets compare all the data and see which threshold gives better options
# conact all data we collected so far
pd.concat(
[
log_reg_model_train_perf,
log_reg_model_test_perf,
log_reg_model_train_perf_threshold_auc_roc,
log_reg_model_test_perf_threshold_auc_roc,
log_reg_model_train_perf_threshold_curve,
log_reg_model_test_perf_threshold_curve,
],
axis=0,
)
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Train | 0.954286 | 0.649701 | 0.834615 | 0.730640 | 0.500000 |
| 0 | Default | Test | 0.950667 | 0.575342 | 0.875000 | 0.694215 | 0.500000 |
| 0 | Optimal threshold value | Train | 0.930000 | 0.868263 | 0.590631 | 0.703030 | 0.161275 |
| 0 | Optimal threshold value | Test | 0.916000 | 0.801370 | 0.546729 | 0.650000 | 0.161275 |
| 0 | Thres Recall vs Preci | Train | 0.953143 | 0.730539 | 0.767296 | 0.748466 | 0.360000 |
| 0 | Thres Recall vs Preci | Test | 0.948000 | 0.650685 | 0.778689 | 0.708955 | 0.360000 |
observations
Default - comparing to other models, Recall score is low - Bank may provide loan to custmers not eligible and lose moneyOptimal threshold value - High recall vlaues, Bank may not give loan to customer not eligible but its also not predicting eligible customers - so bank might lose potential new loan customenrs.Thres Recall vs Preci - this model has balanced recall vs f1 - where bank still prevent loans for not eligible customers and provide loans for eligible. Thres Recall vs Preci enhance model with Forward Feature Selection using SequentialFeatureSelector.
and see we can find a better model.
# import feature_selection
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
#using X & Y which is already built
featureLG = LogisticRegression(solver="newton-cg", random_state=1)
# Build step forward feature selection
sfs = SFS(
featureLG,
k_features=X_train.shape[1],
forward=True, # k_features denotes the number of features to select
floating=False,
n_jobs=-1,
scoring="f1",
cv=5,
)
# Train SFS with our dataset
sfs = sfs.fit(X_train, y_train)
# Print the results
print("Best f1 score: %.2f" % sfs.k_score_) # k_score_ shows the best score
print()
print(
"Best subset (indices):", sfs.k_feature_idx_
) # k_feature_idx_ shows the index of features
# that yield the best score
print()
print(
"Best subset (corresponding names):", sfs.k_feature_names_
) # k_feature_names_ shows the feature names
Best accuracy score: 0.71
Best subset (indices): (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59)
Best subset (corresponding names): ('Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Ventura County', 'County_Yolo County', 'AgeRange_19 to 29', 'AgeRange_30 to 39', 'AgeRange_40 to 49', 'AgeRange_50 to 59', 'AgeRange_60 to 69', 'AgeRange_>=70', 'ExperienceRange_<3', 'ExperienceRange_3 to 5', 'ExperienceRange_6 to 9', 'ExperienceRange_10 to 14', 'ExperienceRange_15 to 19', 'ExperienceRange_20 to 29', 'ExperienceRange_30 to 39', 'ExperienceRange_>=40')
# to plot the performance with addition of each feature
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
fig1 = plot_sfs(sfs.get_metric_dict(), kind="std_err", figsize=(15, 5))
plt.title("Sequential Forward Selection")
plt.xticks(rotation=90)
plt.show()
# using X & Y which is already built
featureLG = LogisticRegression(solver="newton-cg", random_state=1)
# Build step forward feature selection
sfs = SFS(
featureLG,
k_features=20,
forward=True, # k_features denotes the number of features to select
floating=False,
n_jobs=-1,
scoring="f1",
cv=5,
)
# Train SFS with our dataset
sfs = sfs.fit(X_train, y_train)
# let us select the features which are important
feat_cols = list(sfs.k_feature_idx_)
print(feat_cols)
[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17, 24, 28, 33, 41, 44, 54, 57]
# let us look at the names of the important features
X_train.columns[feat_cols]
Index(['Income', 'Family', 'Education', 'Mortgage', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard', 'County_Butte County',
'County_Contra Costa County', 'County_El Dorado County',
'County_Fresno County', 'County_Los Angeles County',
'County_Placer County', 'County_San Bernardino County',
'County_San Mateo County', 'County_Stanislaus County',
'County_Ventura County', 'ExperienceRange_6 to 9',
'ExperienceRange_20 to 29'],
dtype='object')
x_train_final = X_train[X_train.columns[feat_cols]]
# Creating new x_test with the same variables that we selected for x_train
x_test_final = X_test[x_train_final.columns]
# using newton-cg solver since its faster for high-dimensional data
model2 = LogisticRegression(solver="newton-cg", random_state=1)
lg2 = model1.fit(x_train_final, y_train)
# checking model performance for this model
log_reg_model_train_sfs = model_performance_classification_sklearn_with_threshold(
lg2,
x_train_final,
y_train,
threshold=optimal_threshold_curve,
modelname="Top Features from SFS",
datatype="Train",
)
print("Training performance:")
log_reg_model_train_sfs
Training performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Top Features from SFS | Train | 0.949714 | 0.721557 | 0.743827 | 0.732523 | 0.36 |
# checking model performance for this model
log_reg_model_test_sfs = model_performance_classification_sklearn_with_threshold(
lg2,
x_test_final,
y_test,
threshold=optimal_threshold_curve,
modelname="Top Features from SFS",
datatype="Test",
)
print("Training performance:")
log_reg_model_test_sfs
Training performance:
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Top Features from SFS | Test | 0.946667 | 0.657534 | 0.761905 | 0.705882 | 0.36 |
# conact all data we collected so far
pd.concat(
[
log_reg_model_train_perf,
log_reg_model_test_perf,
log_reg_model_train_perf_threshold_auc_roc,
log_reg_model_test_perf_threshold_auc_roc,
log_reg_model_train_perf_threshold_curve,
log_reg_model_test_perf_threshold_curve,
log_reg_model_train_sfs,
log_reg_model_test_sfs,
],
axis=0,
)
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Train | 0.954286 | 0.649701 | 0.834615 | 0.730640 | 0.500000 |
| 0 | Default | Test | 0.950667 | 0.575342 | 0.875000 | 0.694215 | 0.500000 |
| 0 | Optimal threshold value | Train | 0.930000 | 0.868263 | 0.590631 | 0.703030 | 0.161275 |
| 0 | Optimal threshold value | Test | 0.916000 | 0.801370 | 0.546729 | 0.650000 | 0.161275 |
| 0 | Thres Recall vs Preci | Train | 0.953143 | 0.730539 | 0.767296 | 0.748466 | 0.360000 |
| 0 | Thres Recall vs Preci | Test | 0.948000 | 0.650685 | 0.778689 | 0.708955 | 0.360000 |
| 0 | Top Features from SFS | Train | 0.949714 | 0.721557 | 0.743827 | 0.732523 | 0.360000 |
| 0 | Top Features from SFS | Test | 0.946667 | 0.657534 | 0.761905 | 0.705882 | 0.360000 |
Top Features from SFS - Modal has good accuracy & f1 score compared to all models we built. Precision is also high. Recall is little lesser than Thres Recall vs Preci Model.
Thres Recall vs Preci - has overall better score on F1, Recall & Accuracy. This model is better fit for the bank
# let us check the coefficients and intercept of the model
coef_df = pd.DataFrame(
np.append(lg1.coef_, lg1.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df.sort_values(by="Coefficients", ascending=False)
| Coefficients | |
|---|---|
| CD_Account | 3.664580 |
| Education | 1.632071 |
| County_Placer County | 1.072655 |
| County_Contra Costa County | 0.880878 |
| County_Marin County | 0.852858 |
| ExperienceRange_6 to 9 | 0.830868 |
| Family | 0.693912 |
| County_Solano County | 0.672501 |
| ExperienceRange_30 to 39 | 0.656381 |
| County_Santa Clara County | 0.583206 |
| County_Butte County | 0.454077 |
| County_Riverside County | 0.423335 |
| County_Orange County | 0.403205 |
| ExperienceRange_20 to 29 | 0.395152 |
| ExperienceRange_10 to 14 | 0.356979 |
| ExperienceRange_15 to 19 | 0.347598 |
| County_San Diego County | 0.327565 |
| ExperienceRange_3 to 5 | 0.233277 |
| AgeRange_60 to 69 | 0.195572 |
| County_Kern County | 0.192941 |
| AgeRange_19 to 29 | 0.189677 |
| County_Santa Barbara County | 0.159532 |
| CCAvg | 0.120601 |
| County_Yolo County | 0.101025 |
| County_Los Angeles County | 0.075227 |
| County_Sonoma County | 0.071277 |
| Income | 0.051984 |
| County_Ventura County | 0.046343 |
| ExperienceRange_<3 | 0.042816 |
| County_Sacramento County | 0.040078 |
| AgeRange_40 to 49 | 0.003486 |
| Mortgage | 0.000956 |
| County_Monterey County | 0.000902 |
| AgeRange_>=70 | 0.000000 |
| County_Tuolumne County | -0.003135 |
| County_Santa Cruz County | -0.004533 |
| County_Napa County | -0.004873 |
| County_Lake County | -0.007359 |
| County_Imperial County | -0.017238 |
| County_Mendocino County | -0.025005 |
| County_San Luis Obispo County | -0.032351 |
| County_Siskiyou County | -0.033472 |
| County_Humboldt County | -0.088930 |
| AgeRange_50 to 59 | -0.115409 |
| County_Trinity County | -0.121223 |
| County_San Francisco County | -0.162251 |
| County_Merced County | -0.173524 |
| County_Fresno County | -0.184070 |
| County_El Dorado County | -0.200084 |
| County_San Benito County | -0.205821 |
| County_Shasta County | -0.236881 |
| AgeRange_30 to 39 | -0.273329 |
| ExperienceRange_>=40 | -0.317544 |
| County_San Joaquin County | -0.329864 |
| County_Stanislaus County | -0.466314 |
| Online | -0.655996 |
| Securities_Account | -0.736588 |
| County_San Bernardino County | -0.802730 |
| County_San Mateo County | -0.896480 |
| CreditCard | -0.983868 |
| Intercept | -13.470336 |
Important Coefficient interpretations
#import DecisionTree & GridSearchCV
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
)
modelDM1 = DecisionTreeClassifier(criterion="gini", random_state=1)
modelDM1.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(modelDM1, X_train, y_train, threshold=0)
decision_tree_perf_train = model_performance_classification_sklearn_with_threshold(
modelDM1, X_train, y_train, threshold=0, modelname="Default", datatype="Train",
)
decision_tree_perf_train
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Train | 1.0 | 1.0 | 1.0 | 1.0 | 0 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(modelDM1, X_test, y_test, threshold=0)
decision_tree_perf_test = model_performance_classification_sklearn_with_threshold(
modelDM1, X_test, y_test, threshold=0, modelname="Default", datatype="Train",
)
decision_tree_perf_test
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Train | 0.983333 | 0.883562 | 0.941606 | 0.911661 | 0 |
observation on default model performance
## creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
modelDM1,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(modelDM1, feature_names=feature_names, show_weights=True))
|--- Income <= 114.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2557.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- County_Sonoma County <= 0.50 | | | | |--- CreditCard <= 0.50 | | | | | |--- Education <= 2.50 | | | | | | |--- County_Sacramento County <= 0.50 | | | | | | | |--- CCAvg <= 0.25 | | | | | | | | |--- Family <= 3.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Family > 3.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 0.25 | | | | | | | | |--- AgeRange_19 to 29 <= 0.50 | | | | | | | | | |--- weights: [27.00, 0.00] class: 0 | | | | | | | | |--- AgeRange_19 to 29 > 0.50 | | | | | | | | | |--- CCAvg <= 1.70 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- CCAvg > 1.70 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- County_Sacramento County > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Education > 2.50 | | | | | | |--- AgeRange_40 to 49 <= 0.50 | | | | | | | |--- County_Contra Costa County <= 0.50 | | | | | | | | |--- Mortgage <= 320.50 | | | | | | | | | |--- Income <= 108.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 108.50 | | | | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 320.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- County_Contra Costa County > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- AgeRange_40 to 49 > 0.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- CreditCard > 0.50 | | | | | |--- weights: [22.00, 0.00] class: 0 | | | |--- County_Sonoma County > 0.50 | | | | |--- weights: [0.00, 1.00] class: 1 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 94.50 | | | | |--- Online <= 0.50 | | | | | |--- Income <= 82.50 | | | | | | |--- County_San Luis Obispo County <= 0.50 | | | | | | | |--- ExperienceRange_6 to 9 <= 0.50 | | | | | | | | |--- weights: [35.00, 0.00] class: 0 | | | | | | | |--- ExperienceRange_6 to 9 > 0.50 | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- County_San Luis Obispo County > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Income > 82.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.35 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.35 | | | | | | | | |--- CCAvg <= 3.65 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | |--- CCAvg > 3.65 | | | | | | | | | |--- County_Riverside County <= 0.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | | |--- County_Riverside County > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Online > 0.50 | | | | | |--- Mortgage <= 216.50 | | | | | | |--- CCAvg <= 3.75 | | | | | | | |--- CCAvg <= 3.45 | | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.45 | | | | | | | | |--- CCAvg <= 3.55 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 3.55 | | | | | | | | | |--- Income <= 81.50 | | | | | | | | | | |--- weights: [14.00, 0.00] class: 0 | | | | | | | | | |--- Income > 81.50 | | | | | | | | | | |--- AgeRange_50 to 59 <= 0.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- AgeRange_50 to 59 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CCAvg > 3.75 | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | |--- Mortgage > 216.50 | | | | | | |--- Income <= 68.00 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Income > 68.00 | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | |--- Income > 94.50 | | | | |--- Education <= 1.50 | | | | | |--- Family <= 3.50 | | | | | | |--- CCAvg <= 3.75 | | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CCAvg > 3.75 | | | | | | | |--- weights: [30.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- CCAvg <= 4.15 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- CCAvg > 4.15 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Education > 1.50 | | | | | |--- Family <= 2.50 | | | | | | |--- ExperienceRange_30 to 39 <= 0.50 | | | | | | | |--- Education <= 2.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Education > 2.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | |--- ExperienceRange_30 to 39 > 0.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Family > 2.50 | | | | | | |--- AgeRange_60 to 69 <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- AgeRange_60 to 69 > 0.50 | | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- CreditCard <= 0.50 | | | | |--- weights: [0.00, 10.00] class: 1 | | | |--- CreditCard > 0.50 | | | | |--- CCAvg <= 3.85 | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- CCAvg > 3.85 | | | | | |--- CCAvg <= 4.05 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- CCAvg > 4.05 | | | | | | |--- CCAvg <= 4.75 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- CCAvg > 4.75 | | | | | | | |--- weights: [1.00, 0.00] class: 0 |--- Income > 114.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [378.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 44.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- AgeRange_60 to 69 <= 0.50 | | | | |--- Mortgage <= 141.50 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Mortgage > 141.50 | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- AgeRange_60 to 69 > 0.50 | | | | |--- weights: [2.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 222.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
modelDM1.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education 0.410314 Income 0.303044 Family 0.145148 CCAvg 0.062608 CD_Account 0.026906 AgeRange_60 to 69 0.007368 Mortgage 0.006453 AgeRange_40 to 49 0.005520 County_Santa Clara County 0.004854 CreditCard 0.004628 ExperienceRange_30 to 39 0.003276 County_San Luis Obispo County 0.003051 County_Sacramento County 0.002835 County_Riverside County 0.002758 County_Sonoma County 0.002529 AgeRange_50 to 59 0.002482 County_Contra Costa County 0.002331 Online 0.001607 ExperienceRange_6 to 9 0.001565 AgeRange_19 to 29 0.000721 County_Shasta County 0.000000 County_Siskiyou County 0.000000 County_Solano County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 ExperienceRange_20 to 29 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 AgeRange_30 to 39 0.000000 County_Santa Cruz County 0.000000 AgeRange_>=70 0.000000 ExperienceRange_<3 0.000000 ExperienceRange_3 to 5 0.000000 ExperienceRange_10 to 14 0.000000 ExperienceRange_15 to 19 0.000000 County_Yolo County 0.000000 County_San Francisco County 0.000000 County_Santa Barbara County 0.000000 County_Marin County 0.000000 Securities_Account 0.000000 County_Butte County 0.000000 County_El Dorado County 0.000000 County_Fresno County 0.000000 County_Humboldt County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Mendocino County 0.000000 County_San Mateo County 0.000000 County_Merced County 0.000000 County_Monterey County 0.000000 County_Napa County 0.000000 County_Orange County 0.000000 County_Placer County 0.000000 County_San Benito County 0.000000 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_San Joaquin County 0.000000 ExperienceRange_>=40 0.000000
importances = modelDM1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 20))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
observations
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.05, 1: 0.95})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15,20,25, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.05, 1: 0.95}, criterion='entropy',
max_depth=15, min_impurity_decrease=1e-05,
random_state=1)
estimator.criterion
'entropy'
estimator.splitter
'best'
observations
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(estimator, X_train, y_train, threshold=0)
decision_tree_perf_train_HP = model_performance_classification_sklearn_with_threshold(
estimator,
X_train,
y_train,
threshold=0,
modelname="Hyperparameter tuning",
datatype="Train",
)
decision_tree_perf_train_HP
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Hyperparameter tuning | Train | 1.0 | 1.0 | 1.0 | 1.0 | 0 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(estimator, X_test, y_test, threshold=0)
decision_tree_perf_test_HP = model_performance_classification_sklearn_with_threshold(
estimator,
X_test,
y_test,
threshold=0,
modelname="Hyperparameter tuning",
datatype="Test",
)
decision_tree_perf_test_HP
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Hyperparameter tuning | Test | 0.975333 | 0.856164 | 0.886525 | 0.87108 | 0 |
observations
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.569989 CCAvg 0.183819 Family 0.109047 Education 0.091249 CD_Account 0.009134 Mortgage 0.007573 ExperienceRange_20 to 29 0.005979 Online 0.005454 County_San Diego County 0.004541 ExperienceRange_10 to 14 0.002587 AgeRange_50 to 59 0.001610 ExperienceRange_30 to 39 0.001371 ExperienceRange_3 to 5 0.001314 County_Monterey County 0.001267 ExperienceRange_6 to 9 0.001155 CreditCard 0.001099 County_San Bernardino County 0.000926 AgeRange_60 to 69 0.000656 County_Santa Barbara County 0.000622 County_Napa County 0.000609 County_Solano County 0.000000 County_Siskiyou County 0.000000 County_Sonoma County 0.000000 County_Trinity County 0.000000 County_Stanislaus County 0.000000 AgeRange_40 to 49 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Yolo County 0.000000 AgeRange_19 to 29 0.000000 AgeRange_30 to 39 0.000000 County_Santa Cruz County 0.000000 AgeRange_>=70 0.000000 ExperienceRange_<3 0.000000 ExperienceRange_15 to 19 0.000000 County_Shasta County 0.000000 County_San Francisco County 0.000000 County_Santa Clara County 0.000000 County_Los Angeles County 0.000000 Securities_Account 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 County_Fresno County 0.000000 County_Humboldt County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Marin County 0.000000 County_San Mateo County 0.000000 County_Mendocino County 0.000000 County_Merced County 0.000000 County_Orange County 0.000000 County_Placer County 0.000000 County_Riverside County 0.000000 County_Sacramento County 0.000000 County_San Benito County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 ExperienceRange_>=40 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 20))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
observations on important features on tuning
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.05, 1: 0.95})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -2.068596e-16 |
| 1 | 3.268108e-19 | -2.065328e-16 |
| 2 | 4.902162e-19 | -2.060425e-16 |
| 3 | 9.337452e-19 | -2.051088e-16 |
| 4 | 1.330587e-18 | -2.037782e-16 |
| 5 | 1.330587e-18 | -2.024476e-16 |
| 6 | 1.330587e-18 | -2.011170e-16 |
| 7 | 5.625815e-18 | -1.954912e-16 |
| 8 | 6.022656e-18 | -1.894686e-16 |
| 9 | 9.139031e-18 | -1.803295e-16 |
| 10 | 1.264058e-17 | -1.676890e-16 |
| 11 | 1.447305e-17 | -1.532159e-16 |
| 12 | 1.887332e-17 | -1.343426e-16 |
| 13 | 2.987985e-17 | -1.044627e-16 |
| 14 | 2.987985e-17 | -7.458290e-17 |
| 15 | 4.121318e-17 | -3.336972e-17 |
| 16 | 7.918159e-17 | 4.581187e-17 |
| 17 | 8.991966e-17 | 1.357315e-16 |
| 18 | 1.757215e-14 | 1.770788e-14 |
| 19 | 1.040353e-04 | 2.080705e-04 |
| 20 | 1.395063e-04 | 6.265895e-04 |
| 21 | 2.037804e-04 | 8.303699e-04 |
| 22 | 2.059255e-04 | 1.242221e-03 |
| 23 | 2.066355e-04 | 1.448856e-03 |
| 24 | 2.080705e-04 | 1.656927e-03 |
| 25 | 2.084324e-04 | 1.865359e-03 |
| 26 | 2.088865e-04 | 2.074246e-03 |
| 27 | 2.109566e-04 | 2.496159e-03 |
| 28 | 2.246159e-04 | 3.843854e-03 |
| 29 | 2.575254e-04 | 4.873956e-03 |
| 30 | 3.926075e-04 | 5.266563e-03 |
| 31 | 5.104228e-04 | 5.776986e-03 |
| 32 | 5.951627e-04 | 8.157637e-03 |
| 33 | 7.759174e-04 | 8.933554e-03 |
| 34 | 8.089361e-04 | 9.742491e-03 |
| 35 | 8.255121e-04 | 1.056800e-02 |
| 36 | 9.979464e-04 | 1.356184e-02 |
| 37 | 1.021427e-03 | 1.866898e-02 |
| 38 | 1.139587e-03 | 1.980857e-02 |
| 39 | 1.173346e-03 | 2.098191e-02 |
| 40 | 1.370637e-03 | 2.509382e-02 |
| 41 | 1.548130e-03 | 2.664195e-02 |
| 42 | 1.661002e-03 | 2.830296e-02 |
| 43 | 2.960159e-03 | 3.126311e-02 |
| 44 | 3.138425e-03 | 3.440154e-02 |
| 45 | 3.908601e-03 | 4.612734e-02 |
| 46 | 4.470367e-03 | 5.059771e-02 |
| 47 | 1.056439e-02 | 6.116210e-02 |
| 48 | 3.525701e-02 | 1.316761e-01 |
| 49 | 5.118268e-02 | 1.828588e-01 |
| 50 | 2.612581e-01 | 4.441169e-01 |
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.05, 1: 0.95}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2612580922067558
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 10))
ax.set_xlabel("alpha")
ax.set_ylabel("Score")
ax.set_title("F1 & Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="F1 train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="F1 test", drawstyle="steps-post")
ax.plot(
ccp_alphas, recall_train, marker="o", label="Recall train", drawstyle="steps-post"
)
ax.plot(
ccp_alphas, recall_test, marker="o", label="Recall test", drawstyle="steps-post"
)
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00020807050742920554,
class_weight={0: 0.05, 1: 0.95}, random_state=1)
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(best_model, X_train, y_train, threshold=0)
decision_tree_postpruned_perf_train = model_performance_classification_sklearn_with_threshold(
best_model, X_train, y_train, threshold=0,
modelname="Post pruned tuning",
datatype="Train",
)
decision_tree_postpruned_perf_train
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Post pruned tuning | Train | 0.997714 | 1.0 | 0.976608 | 0.988166 | 0 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(best_model, X_test, y_test, threshold=0)
decision_tree_postpruned_perf_test = model_performance_classification_sklearn_with_threshold(
best_model, X_test, y_test, threshold=0,
modelname="Post pruned tuning",
datatype="Test",
)
decision_tree_postpruned_perf_test
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Post pruned tuning | Test | 0.974667 | 0.890411 | 0.855263 | 0.872483 | 0 |
# conact all data we collected so far
pd.concat(
[
decision_tree_perf_train,
decision_tree_perf_test,
decision_tree_perf_train_HP,
decision_tree_perf_test_HP,
decision_tree_postpruned_perf_train,
decision_tree_postpruned_perf_test,
],
axis=0,
)
| Model | Data | Accuracy | Recall | Precision | F1 | Threshold | |
|---|---|---|---|---|---|---|---|
| 0 | Default | Train | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0 |
| 0 | Default | Train | 0.983333 | 0.883562 | 0.941606 | 0.911661 | 0 |
| 0 | Hyperparameter tuning | Train | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0 |
| 0 | Hyperparameter tuning | Test | 0.975333 | 0.856164 | 0.886525 | 0.871080 | 0 |
| 0 | Post pruned tuning | Train | 0.997714 | 1.000000 | 0.976608 | 0.988166 | 0 |
| 0 | Post pruned tuning | Test | 0.974667 | 0.890411 | 0.855263 | 0.872483 | 0 |
Default decision tree recall/f1 score does not match for train and test, Since data is basied towards 0. we cannot really trust the accuracy score. We should not use this model.
Pre and Post pruned model given better results and very much comparable. Pre pruning misses in recall that means it might give loan to customner who cannot pay.
post pruned shows beeter result in recall and f1 score.
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.632873 CCAvg 0.160281 Family 0.119657 Education 0.055012 CD_Account 0.006184 Mortgage 0.004211 Online 0.004188 ExperienceRange_10 to 14 0.003499 County_San Diego County 0.003237 County_Orange County 0.002652 AgeRange_30 to 39 0.001332 County_San Luis Obispo County 0.001231 ExperienceRange_30 to 39 0.001154 ExperienceRange_6 to 9 0.000937 AgeRange_60 to 69 0.000918 County_Yolo County 0.000874 County_Kern County 0.000471 County_San Bernardino County 0.000461 County_Contra Costa County 0.000424 County_Santa Clara County 0.000403 AgeRange_40 to 49 0.000000 AgeRange_19 to 29 0.000000 County_Shasta County 0.000000 AgeRange_>=70 0.000000 County_Ventura County 0.000000 County_Tuolumne County 0.000000 County_Trinity County 0.000000 County_Stanislaus County 0.000000 ExperienceRange_<3 0.000000 ExperienceRange_3 to 5 0.000000 ExperienceRange_15 to 19 0.000000 County_Sonoma County 0.000000 ExperienceRange_20 to 29 0.000000 County_Solano County 0.000000 County_Siskiyou County 0.000000 AgeRange_50 to 59 0.000000 County_San Francisco County 0.000000 County_Santa Cruz County 0.000000 County_Marin County 0.000000 Securities_Account 0.000000 CreditCard 0.000000 County_Butte County 0.000000 County_El Dorado County 0.000000 County_Fresno County 0.000000 County_Humboldt County 0.000000 County_Imperial County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Mendocino County 0.000000 County_Santa Barbara County 0.000000 County_Merced County 0.000000 County_Monterey County 0.000000 County_Napa County 0.000000 County_Placer County 0.000000 County_Riverside County 0.000000 County_Sacramento County 0.000000 County_San Benito County 0.000000 County_San Joaquin County 0.000000 County_San Mateo County 0.000000 ExperienceRange_>=40 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 20))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Bank should focus on giving loans on high effective featues like customers earning good income, having high education level and multiple people working in household/family and has decent credit spending & having CD account.
Bank should reach customer with good income and education level 2 or 3
Bank should reach customer who has 3-4 family members working and making good income.
Bank should avoid age beyond 50 or having more exprience like 40 yrs because they might be old.
Allbank should focus to find more potential customers and make them personal loan, Last year they gave 9% customers, But they have more potential ccustomer.
Income customers with decent credit usage they would repay Personal_Loan promptly.Education level 2 and 3 with good incomeFamily 3 or 4 making good incomeCD account in bankCCAvg usage.Age 50 yrs or with Experience Range more than 40 yrs, They about to retire so risk in Loan repayment Income or income <50K